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PERCEIVED ITEM-DIFFICULTY IN THREE TESTS OF INTELLECTUAL 

PERFOk^IvIANCE CAPACITY 



Bratfisch, O. , Borg, G. , and Domic, S. Perceived 
item-difficulty in three tests of intellectual performance 
capacity. Reports from the Institute of Applied Psychology, 
the University of Stockholm, 1972, No. 29. - Three 
tests of intellectual performance capacity referring to 
factors V, S, and R according to Thurstone's system of 
primary mental abilities were administered to a total 
number of 34 subjects. Immediately after finishing an 
individual item subjects were asked to estimate the 
perceived difficulty of that item. The ratings were to 
be given on a symmetric scale with 9 categories with 
verbal expression labels. A high correlation between 
the rank order of items according to estimated difficulty 
and the real item sequence was obtained in all three 
tests used (r ^ 0. 92). A linear relationship was found 
between estimated difficulty and standard scores cor re* 
sponding to solution frequencies. A close correspondence 
was noticed between the widths and the levels of the 
ranges of the estimates on the one hand and the corre- 
sponding widths and levels of the standard score ranges 
on the other hand. Subjects who could solve an item 
correctly tended to estimate the difficulty of that item 
as lower than subjects who could not. 



Intro ducti on 

A series of investigations at our institute has been concerned 
with subjective*^ difficulty of various human activities, . i. e. with 
difficulty as perceived by the performing person himself. One of 
our starting points was the reflection that perceived difficulty of 
any given activity rather than corresponding "objective** measure- 
ments would be decisive for a person^s feelings, attitudes, motiva- 
tion, etc. , concerning that activity.. 

Researchers do not appear earlier to have payed much attention 
to this problem area. Guilford and Cotzin report a study on per- 
ceived (felt) difficulty of judgement tasks in 1941, but relatively 



♦ This investigation was supported by research grants from the 
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little work seems to have been done in this field in the following two 
decades. Interest in the measurement of perceived difficulty was, 
however, taken by Borg (i9i>l)# whose study was followed by a series 
of investigations on the topic covering physical performance (e. g. 
Borg, 1962), motor skill (Bratfisch, Domic & Borg, 1970), immedi- 
ate memory (Borg, Bratfisch & Domic, 1971a), visual search tasks 
(Borg, Bratfisch & Domic, 1971b), and a test of reasoning ability 
(Bratfisch, Domic & Borg, 1972). Theoretical, methodological 
and applied problems in connection with perceived difficulty have been 
discussed as well (Borg, Bratfisch & Domic, 1971c). 

The present study falls within the above outlined research program 
and concerns the perceived difficulty of three different kinds of 
intellectual tasks. Our main interest was in the questions (a) how is 
perceived difficulty related to measurements based on performance 
and (b) are there any differences between the estimates of difficulty 
of the three kinds of intellectual tasks. 



The experiments 



Items in three factor tests from a standardized intelligence battery 
(Delta Battery, 1969) regularly applied in connection with vocational 
guidance were used as stimuli in the present experiments. The tests 
refer to factors reasoning ability (R)i spatial ability (S), and verbal 
comprehension (V) in accordance with the system introduced by 
Thur stone and Thur stone (1941) as indicated below: 

(1) ''Number series*' (R) 

1 2 4 8 1 6 32 - - Find the rule by which the 

numbers are arranged and 
fill in the two numbers that 
are missing. 

(2) ''Levers'' (S) 




Indicate in which direction the 
end of the lever will move if 
you move the upper lever bar 
in the direction of the arrow. 
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(3) "Synonyms" (V) 

BOY girl man lad guy 
(translated from Swedish) 



Underline one of the four words 
in small letters that means 
about the same as the word in 
capitals to the left. 



The selection of tests was guided by the endeavour to cover 
intelligence factors which have been shown to accot at for a large 
part of general ability (e. g. Harnqvist, I960). Thus also a bigger 
variety of intellectual activities was covered than had been done in 
earlier investigations on perceived difficulty, where mainly tests of 
reasoning ability were applied (e. g. Borg & For sling, 1964; Borg, 
1969). 

Methods and experimental conditions 

The three tests were administered to the subjects under standard 
conditions. During the performance part of the test the subjects were 
asked to estimate the perceived difficulty of the individual items 
immediately after they had finished an item. No information feed- 
back about the correctness of a solution was given. Difficulty was to 
be estimated even when an item remained unsolved, i. even when 
a subject after having tried to solve an item **gave up**. The estimates 
were to be given on a nine-grade symmetrical category scale. The 
categories were assigned verbal labels as follows: 1 - very, very 
easy; 2 - very easy; 3 - easy; 4 - rather easy; 5 - neither easy 
nor difficult; 6 - rather difficult; 7 - difficult; S - very difficult; 
9 - very, very difficult. The scale was presented prior to the testing 
session and the subjects were carefully instructed to base the estimates 
on their immediate experience of difficulty regardless of the item 
sequence, as item sequence was not necessarily arranged according 
to **objective** difficulty. The experiments were carried out either 
individually or in small groups of 3 to 4 subjects and in one session. 

Means and standard deviations of the experimental estimates 
' together with z -values corresponding to the solution frequencies from 
a group of 100 vocational guidance clients of our institute are shown 
in Tables i, 2, and 3, for the tests "Number series**, '*levers", 
and **Synonyms*' respectively. 

Table 1 Means (M) and standard deviations (s) of the experimental 
estimates in the test '*Number series" as well as standard 
scores (z) corresponding to solution frequencies obtained 
from another group of 100 persons. 



Item 


M 


. s 


z 


Item 


M 


s 


z 


1 


1.85 


0.86 


-1.405 


13 


6.19 


1.44 


+0.306 


2 


1.48 


0.58 


-2.326 


14 


5.15 


1.43 


+0.332 


3 


2.63 


1.11 


-1.341 


15 


5.96 


1.53 


-0.050 


4 


3.44 


1.25 


-0.643 


16 


7.11 


1.16 


+1.175 


5 


3.44 


1.09 


-0.332 


17 


6.67 


1.00 


+0.954 


6 


4.48 


1.50 


-0.706 


18 


6.26 


1.58 


+0.306 


7 


5.33 . 


1.49 


+0.176 


19 


7.63 


1.08 


+1.555 


8 


4.30 


1.07 


-0.878 


20 


7.11 


1.28 


+1.282 


9 


4.93 


1.30 


-0.842 


21 


7.33 


1.42 


+2.326 


10 


5.00 


1.24 


-0. 306 


22 


7.70 


1.20 


+2. 326 


11 


5.56 


1.34 


-0.253 


23 


7.67 


1.33 


+1.341 


12 


4.67 


1.33 


-0.675 


24 


8.15 


0.86 


+2.054 


1 








25 


8.78 


0.51 


+2.326 
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Table 2 Means (M) and standard deviations (s) of the experimental 
estimates in the test ''Levers'* as well as stanU*.rd scores 
(z) corresponding to solution frequences obtained from another 
group of 100 persons. 



Item 


M 


8 


z 


Item 


M 


s 


z 


1 


4. 24 


1. 00 


-0 915 


13 


^ 38 


1 'Kk 

X . 


-V. C, 


2 


4. 33 


1. 06 


-0 878 


14 


S QO 


A \n 


- U. 1 ^0 


3 


4. 86 


1 39 


-0 27Q 




Z7« OO 




■t-U. c fy 


4 


4 95 


1 S3 


-0 77Q 




S A? 


i ftft 
X • oo 


A O AO 


5 


4. 76 


1 14 




17 


Z7« f O 


X • DO 


A 0 "JQ 
-0. ^ / V 


6 


5.05 


1.20 


-0. 050 


18 


6.29 


1.23 


-0.100 


7 


4.76 


0. 94 


-0.739 


19 


6.19 


1. 21 


+0.126 


8 


4.86 


1.42 


-0.739 


20 


5.86 


1.42 


-0.050 


9 


5.38 


1. 24 


-0.468 


21 


6.81 


1.26 


+1.126 


10 


5.38 


1.43 


-0. 100 


22 


6. 90 


1. 58 


+0. 332 


11 


6.14 


1. 01 


+0. 332 


23 


7.14 


1. 53 


+0.202 


12 


5.33 


1.15 


-0.202 


24 


7.29 


1. 59 


+0.279 



Table 3 Means (M) and standard deviations (s) of the experimental 

estimates in the test "Synonyms" as well as standard scores 
(z) corresponding to solution frequencies obtained from another 
group of 100 persons* 
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Item 


M 


s 


z 


Item 


M 


s 


z 


1 


1.33 


0. 56 


-2.326 


16 


4. 33 


1.69 


+0.176 


2 


1.91 


1. 10 


-1.751 


17 


3.04 


1.46 


-1.476 


3 


1.58 


0.50 


-1.751 


18 


3.92 


1.74 


-0.995 


4 


2.83 


1.34 


-0.995 


19 


4. 00 


1.29 


-0.995 


5 


2.79 


1.06 


-2.326 


20 


5.00 


1.74 


-0.583 


6 


3.04 


1.34 


-1.645 


21 


5.17 


1. 55 


+0.228 


7 


3.29 


1.20 


-1.405 


22 


5. 17 


1.49 


+0.151 


8 


4.08 


1.41 


-1.881 


23 


4.75 


1. 36 


+0.643 


9 


3.62 


1.41 


-0.613 


24 


5.88 


1. 54 


+0.413 


10 


3.79 


1.86 


-0.806 


25 


6. 17 


1.55 


+1.341 


11 


3.75 


1.82 


+0. 075 


26 


5.83 


1.95 


+1.341 


12 


4.46 


1.32 


-0.842 


27 


6. 04 


1.63 


+0.413 


13 


3.92 


1.64 


-1.751 


28 


5.75 


1.80 


+0.279 


14 


3.87 


1.54 


-0.496 


29 


6.71 


2. 12 


+0.100 


15 


4.83 


1.79 


-0.440 


30 


7.13 


1.65 


+1.341 



Subjects 



Altogether 34 subjects with high school education participated in 
the experiments. Twenty-seven of them participated in the experiment 
with the test ''Number series", 21 in the one with the test ''Levers", 
and 24 in the one with the test "Synonyms", Eight subjects were in- 
volved in all the experiments, 22 in two, and 4 in only one. The 
proportio 1 of males to females was in all three experimental groups 
about 1:1. The age of the subjects ranged from 20 to 37 years with 
a median age of 23. 



Results and discussion 
Inter -individual variation in estimated difficulty 

A number of investigations have shown that the inter-individual 
variation in estimates is largely depending on the scaling method 
applied (see e. g. Ekman & Kttnnapas, 1969); Bratfisch, Domic & 
Borg, 1972). An analysis of the relation between means and the 
corresponding standard deviations can thus be regarded as an aspect 
of the reliability of the estimates when no other information (e. g. 
repeated estimated of the stimuli) is available. 

In Figure lA the standard deviations of the experimental estimates 
of each item in the test "Number series" are plotted against the 
corresponding means* Figures IB and IC show the same kind of 
data for the tests "Levers" and "Synonyms", respectively. 

An inverse U-shaped trend is roughly descriptive for the relation 
between standard deviations and mean estimates in Figure lA, while 
the standard deviations are growing linearly with increasing means 
in Figure IC. The data in Figure IB show a considerable scatter, 
but a linear fxmction can be said also here to describe the trend 
approximately from a parsimonious point of view. An inverse U- 
shaped relation between standard deviations and means is to be 
expected when the frequency distributions of the extreme stimuli on 
a scale with defined upper and lower botmdaries are truncated by 
not permitting the subjects to place any stimuli below or above the 
boundaries. This was the case with "Number series" (Fig. lA), but 
no such end effects were noticed with regard to "Levers" (Fig. IB) and 
only in single cases with "Synonyms" (Fig. IC). On the contrary, the 
means of the estimates concerning "Levers" cover categories 4 to 7 
only, and those of "Synonyms" mainly categories 1 to 7, and no 
marked skewness of the individual item- distributions was noticed 
in either of these two tests with the exception of the first two stimuli 
of the test "Synonyms"* With this kind of data the relation between 
standard deviations and means is expected to be by and large linear. 
Referring to the reasoning above the reliability of the data obtained in 
the three experiments can be regarded as satisfactory. 



* When analyzing the frequency distributions of the items of the three 
tests they did not appear to be skewed or truncated (with but a few 
exceptions concerning "Number series" and "Synonyms") and it was 
decided thus to present means and standard deviations, though medians 
and semi -interquartile ranges were computed as well. No considerable 
difference, however, could be noticed when using either of these 
statistics. 
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Fig. 1 Standard deviations plotted against arithmetic means of 

estimates. Diagram A shows data from ''Number series", 
Diagram B those from "Levers", and Diagram C those 
from "Synonyms". The regression lines in Diagrams B and 
C were fitted mathematically, the curve drawn in Diagram A 
was fitted by eye. 

Estimated and "objective" difficulty 

Figure 2 shows means of the experimental estimates of difficulty 
plotted against the order of items in the tests. Diagram A represents 
the data from "Number series", Diagram B those from "Levers", 
and Diagram C those from "Synonyms". The close relationship between 
the sets of data is in all three graphs quite evident and is numerically 
confirmed by Spearman coefficients of rank-order correlation of 
0.97» 0.92 and 0. 92 for the data in Diagrams A, B, and C, res- 
pectively. 
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Fig. 2 Means of estimates plotted against the real order of items in 
the tests* Diagram A shows the data from "Number series". 
Diagram B those from "Levers", and Diagram C those from 
"Synonyms". 
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It could be argued that these close relationships are due to the 
fact that items were to be estimated in the order of occurrence in the 
test. However, similar investigations (Borg & Forsling, 1964; Borg, 
1969; Bratfisch et ah , 1972), where items were presented randomly 
and not according to item sequence in the test, showed rank-order 
correlation coefficients of about 0. 90 between real item sequence in 
the test and rank order according to estimates of difficulty. That the 
coefficients of correlation are only slightly higher in the present 
investigation probably means that the order in which items are 
presented has but negligible effects on the estimates. 

Figure 3 shows means of perceived difficulty as a function of 
standard scores (z -values), corresponding to the solution frequencies 
(p) of a group of 100 vocational guidance clients with the same level 
of education as the experimental groups (cf Tables 1, 2, and 3). The 
category scale applied is thus treated as a scale with equal intervals. 
The usage of z -values corresponding to solution frequencies from 
another group of 100 persons instead of those of the three experimental 
groups in Figure 3 is due to the fact that the standard error of a 
p-index also depends on the size of the sample population, which is 
rather small in the experimental groups. Diagram A in Figure 3 
represents again the data from "Number series*'. Diagram B those 
from ''Levers** and Diagram C those from ''Synonyms**. 




2 -values corresponding to solution frequencies iS) 



Fig. 3 Means of estimates related to standard scores (z-values). 

Diagram A shows data from **Number series**, Diagram B 
those'from **Levers*', and Diagram C those from **Synonym6*** 

The slopes of the fitted regression lines are 1.4, 1.5, and 1. 2, 
the Pearson coefficients of correlation are 0*94, 0.80, and 0.82, and 
the Spearman coefficients of rank order correlation between the 
rank order based on the mean estimates and the rank order according 
to the z-values are 0. 95, 0* 80, and 0. 84 for the data in Diagrams 
A, B, and C, respectively. Rank-order coefficients of correlation 
of 0. 90 between averaged estimates of difficulty and z-values as 
well as a by and large linear relationship has been found in similar 
investigations (Borg & Forsling, 1964, 1965; Bratfisch et al. , 1972). 

Apart from the close relationship between the sets of data, all the 
diagrams in Figure 3 show a close correspondence between the widths 
and the levels of the ranges of the estimates on the one hand and the 



corresponding width and levels of the z -ranges on the other hand. An 
"objectively" medium difficult item is also estimated as medium 
difficult even if it is presented first when estimating, as the results * 
from e. g. "Levers" show. A rather difficult item according to solution 
frequencies is also estimated as rather difficult (and not e. g. as very, 
very difficult) even if it is presented as one of the last items to be 
judged, as the results from both "Levers" and "Synonyms" show. 

The striking differences in estimates between the three tests are 
hardly due to differences between the experimental groups, as the 
results from the following analysis show. Means of the experimental 
estimates of those 8 subjects participating in all three experiments 
were computed. These values are plotted against z-values in Figure 4. 
Diagram A shows again data from "Number series", Diagram B those 
from "Levers", and Diagram C those from "Synonyms". 
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Fig, 4 Means of estimates of 8 subjects participating in all three 

experiments as related to standard scores (z -values). Diagram 
A shows data from "Number series", Diagram B those from 
"Levers", and Diagram C those from "Synonyms'^ 

The results shown in Figure 4 are almost identical with those 
shown in Figure 3. In Figure 4 the slopes of the fitted regression 
lines are 1, 5, 1.4, and 1.2, the Pearson coefficients of correlation 
are 0. 95, 0. 81 , and 0. 80, and the Spearman coefficents of rank 
order correlation between the rank order based on the mean estimates 
and the rank order according to the z*values are 0.91, 0.82, and 
0.87 for the data in Diagrams A, B, anJ C, respectively. The very 
close correspondence between the results based on only 8 subjects and 
the results obtained by the total experimental groups is remarkable 
and implicates a high degree of reliability of the obtained trends. 

Another interesting aspect, aimed at illuminating how estimates 
of difficulty come about, can be reported in connection with the test 
"Number series" where experimental conditions additional to those 
described above were given. The items in "Number series** are not 
of multiple choice character like the items in "Levers" and "Synonyms", 
which was obviously the reason why certain subjects did not give any 
answers to certain items in the first-named test (this phenomenon 
never occurred in the other two tests). This happened altogether 



25 times and every time subjects estimated the difficulty as "9*' 
(very, very difficult). Whenever this was the case the experimenter 
explained to the subject how to solve the item in question and asked 
him anew to estimate the difficulty of che item* As a new response 
^*7*' (difficult) was given 8 times, '*8** (very difficult) 14 times, and 
only 4 times the answer remained unchanged *^9'* (very, very diffi- 
cult). It would thus seem that those subjects who could not give an 
anser to an item responded purely in accordance with the experience 
of their own performance capacity (*4f I can't solve this item it must 
be very, very difficult'*). As soon as subjects have alternatives (as 
in the tests ''Levers** and ''Synonyms" through their multiple choice 
character) or are given one (as was the case in "Number series" when 
an itom remained unsolved) there seems to be a tendency to estimate 
the difficulty of items which one actually cannot solve as somewhat 
lower than very, very difficult (there is a known chance of picking 
the right aLernative, or the explanation is understood and causes a 
slight drop in estimated difficulty). 

Other interesting features seen in Figures 3 and 4 are the slopes 
of the fitted regression lines. It is likely that knowledge about how and 
at what rate estimated difficulty increases with corresponding measure 
ments based on performance could be useful for test constructing 
purposes together with other "subjective" and "objective" measure- 
ments. 

In the further proceedings of the data analysis, the subjects 
participating in the three experiments were classified into subgroups 
homogeneous with respect to sex, age and performance on the test. 
The data of the various subgroups showed by and large the same 
general trends as have been described in the preceding sections; 
thus they need not be presented. 

Another way to detect possible differences between estimates of 
subjects is to calculate, for each item in each test, the mean of 
estimates of subjects who actually solved an item correctly and for 
subjects who did not. Only items with at least 25% of observations in 
either of the categories "solved" or "unsolved" were considered. 
Figure 5 shows such means plotted against the order of items in the 
tests. Diagram A in Figure 5 represents data from "Number series", 
Diagram B those from "Levers", and Diagram C those from 
"Synonyms". 

When examining the diagrams it is seen that subjects who solved 
the items of "Number series" and "Levers" correctly estimated the 
difficulty of the items throughout as lower than those subjects who 
did not, while no such tendency can be noticed in Diagram C, 
representing "Synonyns". The means of estimates of the two groups 
in each test were calculated. The differences between the central 
tendencies thus obtained (0.98 in "Number series", 0.78 in "Levers" 



* These repeated estimates were not included when calculating the 
statistics presented in this report. 
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and 0 13 in "Synonyms") were statistically significant on the 1% level 
for -'Number series" (t = 2. 982, df 28) and "Levers" (t = 2 888 
df = 26) but not for "Synonyms". The reason why the estimates of the 
subjects with the "better" and the "poorer" performance do not differ 
with respect to the test "Synonyms" is probably found in the facts that 
this test consists of many "subjectively" easy items (the first 19 
Items out of 30 are estimated as less than medium difficult) and that 
also the performance on the test throughout is high. 




t 6 t2 t8 24 

Real item sequence in the test 



25 30 



Fig. 5 Mean estimates of subjects who actually solved an item correct- 
ly (open circles) as well as mean estimates of subjects who 
aid not (filled circles) plotted against the real order of 
occurrence in the tests. Diagram A shows data from "Number 
series , Diagram B those from "Levers" and Diagram C 
those from "Synonyms". 

The main result of this study - the close and linear relationship 
between estimated difficulty of intellectual tasks and correspondine 
measurr ,nts based on performance - has now to be regarded as a 
rather % -established fact. The differences in the estimates of 
difficulty oetween the three tests seem mainly to be due to the different 
lllZ .n^ of objective item difficulty within each test. The number of 
fZ ^ fr 1^ probably the most important cue when estimating 

the difficulty of an item which one would not be able to answer when 

.\ ""t ^^"^^^ alternatives seem to cause a drop 

in estimated difficulty in such cases. It would be interesting to 
investigate to what degree a change in the number of alternatives 
in a given test causes differences in the estimates of difficulty and in 
the solution frequencies. 
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